This report provides the analysis of RNA-Seq data from iron-depletion experiments in Giardia WB. 2 conditions, 1 control and three replicates for each. Analysis was done with edgeR.
The new Giardia WB genome was used.STAR v020201 was used to map the concatenated reference genomes. ReadsPerGene.out.tab output from STAR with specifying --quantMode GeneCounts.
TYDK is the control, NOFERRIC is TYDK without ferric (less iron), 50uM is TYDK without ferric and supplemented with 50 µM iron inhibitor (even less iron).
Read and merge files containing gene expression counts.
Read the pre-compiled annotation_tab, and add the annotation to the DGEList. There is no Name for the Giardia annotation.
Only genes with reasonable expression level should be included in the analysis. Keep genes with at least 2 count per million in at least 3 samples (replicates).
## keep
## FALSE TRUE
## 415 4976
Plot number of reads matched per sample, and total for each gene across all samples. The high peak in the counts per genes are the ribosomal RNAs.
The raw library sizes were normalized with a scaling factor for each sample that aims to minimize the fold changes between the samples for the majority of the genes. Ideally all scaling factors should be around 1. TMM the default method was used.
We can see the lib.size are quite equivalent, and normalization factors are also around 1.
Below we can compare the different libraries before and after normalizations. Relative Log Expression (RLE) plots are powerful tool for visalizaing unwanted variation in high dimensional data. Ideally all boxplots should be centered around 0 and be of similar width. Compare the boxplots before and after normaliztion, we can see the normalization worked well for this data.
The MDS plot shows the unsupervised clustering of the samples. It’s a two-dimensional scatterplot that distances on the plot approximate the typical log2 fold changes between the samples based the top 500 most variable genes between each pair of samples. On this plot, we can see that samples from the same hours do cluster together. But batch effect can not be excluded, which will be discussed later.
The square root of the common dispersion gives the coefficient of variation of biological variation (BCV). The BCV indicates how the abundance of each gene varies between replicate samples. The common BCV for the data set:
## [1] 0.004928227
The number is low and good. A plot of the averge \(\log_2\)CPM versus the BCV for each gene is shown below. The plot visualises how gene variation between replicate samples (BCV) changes as gene abundance (\(\log_2\) counts per million) increase. We can see the trended dispersion shows a decreasing trend with expression level. Only the trended dispersion is used under the quasi-likelihood (QL) pipeline. ## Dispersion estimation The QL dispersions can be estimated using a genewise negative binomial generalized linear models with Quasi-likelihood test (glmQLFit), and then be visualized with the plotQLDisp function. The big sample size variation cause the QL dispersions are squeezed very heavily from the raw values. The data is transformed to \(\log_2\)CPM.
QL F-test (glmQLFTest) were used to determine significant differential expression. The QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It also provides more robust and reliable error rate control when the number of replicates is small. ## Contrasts Comparisons of interest listed below. Contrast defines the null hypothesis as the comparison intersted to be equal to zero.
The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.
Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.
A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.
The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.
Inspect the depth-adjusted reads per million for the top differentially expressed.
The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.
Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.
A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.
The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.
Inspect the depth-adjusted reads per million for the top differentially expressed.
The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.
Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.
A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.
The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.
Inspect the depth-adjusted reads per million for the top differentially expressed.
Genes that are DE are the ones with adjusted p-value (same as FDR in the table) less than a defined false discovery (FDR) rate. The adjusted p-value is used as opposed to the initial p-value as it has been adjusted for multiple testing. The FDR cutoff is set at 0.05. So genes with adjusted p-value < 0.05 is considered as DE, which were further classified into up- or down- regulated. Up-regulated genes has a positve log-fold change, while down-regulated genes has a negative log-fold change. DE with abs(logFC) > 1 was also show below.
Heatmap of the logFC of the DE genes as well as heatmap of the top 500 genes.
Venn Diagram of the three most interested comparisons with multiple DE. Note the numbers do not agree between both venn diagrams. The second diagram shows only those genes that were either up in both samples or down in both samples. The first diagram also includes genes that were up in one sample and down in the other, which is a less restrictive criterion. And the total numbers in each contrast do add up the same.